9 research outputs found

    Encryption by using base-n systems with many characters

    Full text link
    It is possible to interpret text as numbers (and vice versa) if one interpret letters and other characters as digits and assume that they have an inherent immutable ordering. This is demonstrated by the conventional digit set of the hexadecimal system of number coding, where the letters ABCDEF in this exact alphabetic sequence stand each for a digit and thus a numerical value. In this article, we consequently elaborate this thought and include all symbols and the standard ordering of the unicode standard for digital character coding. We show how this can be used to form digit sets of different sizes and how subsequent simple conversion between bases can result in encryption mimicking results of wrong encoding and accidental noise. Unfortunately, because of encoding peculiarities, switching bases to a higher one does not necessarily result in efficient disk space compression automatically.Comment: 12 pages, 6 figure

    Language classification from bilingual word embedding graphs

    Full text link
    We study the role of the second language in bilingual word embeddings in monolingual semantic evaluation tasks. We find strongly and weakly positive correlations between down-stream task performance and second language similarity to the target language. Additionally, we show how bilingual word embeddings can be employed for the task of semantic language classification and that joint semantic spaces vary in meaningful ways across second languages. Our results support the hypothesis that semantic language similarity is influenced by both structural similarity as well as geography/contact.Comment: To be published at Coling 201

    An open problem in computational stemmatology - a model for contamination

    No full text
    In this contribution, two open problems in computational stemmatology are being considered. The first one is contamination, an umbrella term referring to all phenomena of admixture of text variants resulting from scribes considering more than one manuscript or even memory when copying a text. This problem is one of the biggest to date in stemmatology since it implies an entirely different formal approach to the reconstruction of the copy history of a tradition and in turn to the reconstruction of an urtext. (Maas 1937) famously stated that there is no remedy against contamination and (Pasquali and Pieraccioni 1952) coined the terms 'open' vs. 'closed' recensions to distinguish contaminated from uncontaminated. We present a graph theoretical model which formally accommodates traditions with any degree of contamination while maintaining a temporal ordering and give combinatorial numbers and formula on the implication for numbers of possible scenarios

    Tools, evaluation and preprocessing for stemmatology

    No full text
    Die vorliegende Arbeit beschäftigt sich mit dem Thema Stemmatologie, d.h. primär der Rekonstruktion der Kopiergeschichte handschriftlich fixierter Dokumente. Zentrales Objekt der Stemmatologie ist das Stemma, eine visuelle Darstellung der Kopiergeschichte, welche i.d.R. graphtheoretisch als Baum bzw. gerichteter azyklischer Graph vorliegt, wobei die Knoten Textzeugen (d.s. die Textvarianten) darstellen während die Kanten für einzelne Kopierprozesse stehen. Im Mittelpunkt des Wissenschaftszweiges steht die Frage des Autorenoriginals (falls ein einziges solches existiert haben sollte) und die Frage der Rekonstruktion seines Textes. Das Stemma selbst ist ein Mittel zu diesem Hauptzweck (Cameron 1987). Der durch für manuelle Kopierprozesse kennzeichnende Abweichungen zunehmend abgewandelte Originaltext ist meist nicht direkt überliefert. Ziel der Arbeit ist es, die semi-automatische Stemmatologie umfassend zu beschreiben und durch Tools und analytische Verfahren weiterzuentwickeln. Der erste Teil der Arbeit beschreibt die Geschichte der computer-assistierten Stemmatologie inkl. ihrer klassischen Vorläufer und mündet in der Vorstellung eines einfachen Tools zur dynamischen graphischen Darstellung von Stemmata. Ein Exkurs zum philologischen Leitphänomen Lectio difficilior erörtert dessen mögliche psycholinguistische Ursachen im schnelleren lexikalischen Zugriff auf hochfrequente Lexeme. Im zweiten Teil wird daraufhin die existenziellste aller stemmatologischen Debatten, initiiert durch Joseph Bédier, mit mathematischen Argumenten auf Basis eines von Paul Maas 1937 vorgeschlagenen stemmatischen Models beleuchtet. Des Weiteren simuliert der Autor in diesem Kapitel Stemmata, um den potenziellen Einfluss der Distribution an Kopierhäufigkeiten pro Manuskript abzuschätzen. Im nächsten Teil stellt der Autor ein eigens erstelltes Korpus in persischer Sprache vor, welches ebenso wie 3 der bekannten artifiziellen Korpora (Parzival, Notre Besoin, Heinrichi) qualitativ untersucht wird. Schließlich wird mit der Multi Modal Distance eine Methode zur Stemmagenerierung angewandt, welche auf externen Daten psycholinguistisch determinierter Buchstabenverwechslungswahrscheinlichkeiten beruht. Im letzten Teil arbeitet der Autor mit minimalen Spannbäumen zur Stemmaerzeugung, wobei eine vergleichende Studie zu 4 Methoden der Distanzmatrixgenerierung mit 4 Methoden zur Stemmaerzeugung durchgeführt, evaluiert und diskutiert wird

    A Manual for Web Corpus Crawling of Low Resource Languages

    No full text
    Since the seminal publication of “Web as Corpus” [1], the potential of creating corpora from the web has been realized for good for the creation of both online and offline corpora: noisy vs. clean, balanced vs. convenient, annotated vs. raw, small vs. big are only some antonyms that can be used to describe the range of possible corpora that can be and have been created. In our case, in the wake of the project Under Resourced Language Content Finder (URLCoFi), we describe a systematic approach to the compilation of corpora for low (or under) resource(d) languages (LRL) from the web in connection with a free eLearning course funded by studiumdigitale at Goethe University, Frankfurt. Despite the ease of retrieval of documents from the web, some characteristics of the digital medium introduce certain difficulties. For instance, if someone was to collect all documents on the web in a certain language, firstly, the collection could only be a snapshot since the web constantly changes content and secondly, there would be no way to ascertain completeness. In this paper, we show ways to deal with such difficulties in search scenarios for LRLs presenting experiences springing from a course about this topic.[1] A. Kilgarriff and G. Grefenstette, “Web as corpus,” in Proceedings of Corpus Linguistics 2001, 2001, pp. 342–344

    A practitioner’s view: a survey and comparison of lemmatization and morphological tagging in German and Latin

    No full text
    The challenge of POS tagging and lemmatization in morphologically rich languages is examined by comparing German and Latin. We start by defining an NLP evaluation roadmap to model the combination of tools and resources guiding our experiments. We focus on what a practitioner can expect when using state-of-the-art solutions. These solutions are then compared with old(er) methods and implementations for coarse-grained POS tagging, as well as fine-grained (morphological) POS tagging (e.g. case, number, mood). We examine to what degree recent advances in tagger development have improved accuracy – and at what cost, in terms of training and processing time. We also conduct in-domain vs. out-of-domain evaluation. Out-of-domain evaluation is particularly pertinent because the distribution of data to be tagged will typically differ from the distribution of data used to train the tagger. Pipeline tagging is then compared with a tagging approach that acknowledges dependencies between inflectional categories. Finally, we evaluate three lemmatization techniques

    Handbook of Stemmatology: History, Methodology, Digital Approaches

    No full text
    Stemmatology studies the aspects of textual criticism using genealogical methods to analyse a set of copies from a text whose autograph is lost. As an art (ars) stemmatology has its main goal in editing, and thus presenting to the reader, such a text in the most satisfactory way; as a more abstract discipline (scientia) it is interested in the general principles of how texts change in the process of being copied. This handbook provides the first coverage of the entire field: theoretical and practical aspects of traditional and modern digital methods. Thirty eight experts from all involved fields joined forced to write the book which covers in forty one sections topics from material aspects of text traditions, through methods of traditional textual criticism, to modern digital methods used in the field. The two final chapters provide closer views of how the approach towards texts and textual criticism has developed in some well-defined disciplines of textual scholarship and compare methods used in other fields dealing with "descent with modification", respectively. Illustrations with many practical examples from a wide range of disciplines are provided to render the content more accessible. The intended readership comprises both students of various fields involved with texts and more advanced scholars

    Handbook of Stemmatology: History, Methodology, Digital Approaches

    No full text
    Stemmatology studies the aspects of textual criticism using genealogical methods to analyse a set of copies from a text whose autograph is lost. As an art (ars) stemmatology has its main goal in editing, and thus presenting to the reader, such a text in the most satisfactory way; as a more abstract discipline (scientia) it is interested in the general principles of how texts change in the process of being copied. This handbook provides the first coverage of the entire field: theoretical and practical aspects of traditional and modern digital methods. Thirty eight experts from all involved fields joined forced to write the book which covers in forty one sections topics from material aspects of text traditions, through methods of traditional textual criticism, to modern digital methods used in the field. The two final chapters provide closer views of how the approach towards texts and textual criticism has developed in some well-defined disciplines of textual scholarship and compare methods used in other fields dealing with "descent with modification", respectively. Illustrations with many practical examples from a wide range of disciplines are provided to render the content more accessible. The intended readership comprises both students of various fields involved with texts and more advanced scholars
    corecore